library(mosaic)
library(tidyverse)
library(pander)
library(DT)
library(ggrepel)
library(plotly)
library(dplyr)
library(ggplot2)
library(maps)
library(tmap)
library(leaflet)
library(htmltools)
library(car)
A residual is just the difference between:
\[r_i = Y_i - \hat{Y_i}\]
Think of it as “how far off was my prediction of that jar of jelly beans?”
Click between tabs for further explanations
When looking at residual plots, you want to see points scattered randomly - like if someone threw a bunch of marbles on the floor (accidentally of course). If you see clear patterns, something might be wrong with your model.
Imagine you’re baking cookies:
The residual is how far off your actual baking time was from the predicted 12 minutes. Sometimes it’s over, sometimes under, and sometimes exactly right (just depends on how burnt you like your cookies, JK).
This helps you see how accurate your recipe’s timing prediction is for each batch of cookies.
A SSE is the measurement of how much the residuals(the observed value - the predicted value) deviate from the line(the law). This can also be explained as the amount of variability that is NOT explained by the model.
This is calculated by the following model:
\[SSE = \underbrace{\sum_{i=1}^n}_\text{The sum of} (\underbrace{Y_i}_\text{Observed Value(The Dots)} - \underbrace{\hat{Y_i}}_\text{Predicted Value (The Line)})^2 \]
Click between tabs for further explanations
We want these differences to be small compared to how much your drive times vary overall (SSTO).
Think of predicting how long it takes to drive to work:
Your actual drive times vary (maybe 20, 25, or 30 minutes depending on things traffic, how fast you drive, who knows?), but your prediction model says it always takes 23 minutes
A SSR is the measurement of how much the regression line (the law) departs from the average y-value (overall mean). This can also be explained as the amount of variability EXPLAINED by the model by showing how far our predicted y values deviate from the overall mean.
This can be calculated by the following model:
\[SSR = \underbrace{\sum_{i = 1}^n}_\text{The sum of} (\underbrace{\hat{Y_i}}_\text{Predicted Y (The Line)} - \underbrace{\bar{Y}}_\text{Average Y (Overall Mean)})^2\]
Click between tabs for further explanations
SSR matters because it tells us how good our predictions are.
It shows how much of what we’re trying to predict can actually be explained by our model - A larger SSR means our predictions are more reliable and useful - It helps us decide if our prediction method is worth using
Imagine predicting pizza delivery times:
The delivery app says:
SSR measures how much these categories actually help EXPLAIN delivery times. For example:
Just like you can’t have “negative accuracy” in predictions, SSR can’t be negative. The bigger the SSR compared to total variation (SSTO), the better your prediction model is working.
A SSTO is the measurement of how much the y-values departs from the average y- value. This can also be explained as the total variability of our model.
Key Relationship: SSTO = SSR + SSE -> Total Variation = Explained Variation + Unexplained Variation
This is calculated by the following:
\[SSR + SSE = SSTO = \underbrace{\sum_{i=1}^n}_\text{The sum of} (\underbrace{Y_i}_\text{Observed Y Values (The Dots)} - \underbrace{\bar{Y}}_\text{Average Y (Overall Mean)})^2\]
Click between tabs for further explanations
The total variation (SSTO) helps us to know if our predictions are actually useful or just lucky guesses!
Imagine you own a coffee shop and want to understand your daily sales patterns:
Total Variation (SSTO):
This total variation can be broken into two parts:
The better your prediction model, the more of your total variation (SSTO) is explained by your model (SSR), and the less remains unexplained (SSE).
The R-squared is the proportion of variability in Y that can be explained by the regression.
Definition breakdown:
\[R^2 = \frac{SSR}{SSTO} = 1 - \frac{SSE}{SSTO} \]
Click between tabs for further explanations
It tells us how reliable our predictions are! - Additionally, it shows us how confident we can be in our predictions
R-squared VS P-value
We can further understand R-squared by how it differs from the p-value for slope:
Imagine you’re analyzing how a cat’s playtime affects their sleep duration:
This demonstrates the key concepts from the selection: proportion (80% explained), variability (fluctuating sleep patterns), and what can be explained by the regression (playtime’s effect).
The MSE is the measurement of the average squared difference between predicted and actual values - Can be any non-negative number (0 to infinity) - Units are squared units of the original data (e.g., degrees Fahrenheit²)
\[MSE = \frac{SSE}{n-p}\]
Relationship to R-squared
| MSE | R-Squared |
|---|---|
| measures squared prediction error | measures proportion of variance in y explained |
| between 0 and infinity | between 0 and 1 (0% - 100%) |
| units are squared units of the original data | unitless |
The Residual Standard Error (RSE) is the square root of MSE. - Found in R regression summary output - Uses same units as original data (e.g., degrees Fahrenheit)
\[RSE = \sqrt{MSE} = \sqrt{\frac{SSE}{n-p}}\]
Click between tabs for further explanations
Together, they indicate the fit of our model:
Think of predicting daily temperatures:
The MSE would be like measuring how far off your temperature predictions are on average, but the errors are squared - If you predict 75°F and it’s actually 73°F - that’s a difference of 2°F, which gets squared to 4°F² - The MSE would be the average of all these squared differences
The Residual Standard Error (RSE) would convert this back to the original temperature units by taking the square root - So instead of 4°F², you’d get back to a value in °F - making it more intuitive to understand how far off your predictions typically are
Lower values in both cases would mean your temperature predictions are more accurate!! - aka. you’re better at forecasting the actual temperatures that occur! (like a psychic)
For this study, we were tasked with predicting the “Actual Maximum Air Temperature” for this coming Monday, January 13th at BYU-Idaho. BYU-Idaho is located in the city of Rexburg, Idaho, and thus we will use this city’s weather recordings from timeanddate.com to make our predictions.
janweather <- read.csv("C:/Users/paige/OneDrive/Documents/Fall Semester 2024/MATH 325/Statistics-Notebook-master/Data/JanWeather.csv")
prediction <- data.frame(
STARTMAXTEMP=16,
MAXTEMP= 26,
label = "Prediction Point : 26°F"
)
janweathery_plot <- ggplot(janweather, aes(x = STARTMAXTEMP, y = MAXTEMP)) +
geom_point(
aes(
text = paste(
"Date:", DATE, "<br>",
"Start Max Temp. of the Day:", STARTMAXTEMP, "\u00b0F<br>",
"Max Temp. of the Day:", MAXTEMP, "\u00b0F"
)
),
size = 2,
color = "darkblue"
) +
geom_smooth(method = "lm", formula= y~x, se = FALSE, color = "dodgerblue") +
labs(
title = "Weather Patterns from January 13th's of the Past",
x = "Max Start Temperature of the Day (\u00b0F)",
y = "Max Temperature of the Day (\u00b0F)"
) +
geom_point(data=prediction,
aes(x=STARTMAXTEMP, y=MAXTEMP),
size = 3,
color= "red") +
geom_text(
data = prediction,
aes(x = STARTMAXTEMP, y=MAXTEMP, label = label),
nudge_x = -7,
nudge_y = 3.6,
color= "red",
size = 3
) +
theme_minimal()
ggplotly(janweathery_plot, tooltip = "text")
This is our mathematical model: \[\underbrace{Y_i}_\text{MAXTEMP} = \overbrace{\beta_0}^\text{Intercept} + \overbrace{\beta_1}^\text{Slope} \underbrace{X_i}_\text{STARTMAXTEMP}+ \epsilon_i \text{ where} \sim N(0,\sigma^2)\]
This is our Simple Linear Regression test:
janlm <- lm(MAXTEMP ~ STARTMAXTEMP, data=janweather)
summary(janlm)%>%
pander()
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 13.68 | 2.583 | 5.297 | 0.001835 |
| STARTMAXTEMP | 0.743 | 0.1214 | 6.119 | 0.0008698 |
| Observations | Residual Std. Error | \(R^2\) | Adjusted \(R^2\) |
|---|---|---|---|
| 8 | 4.275 | 0.8619 | 0.8389 |
Using this study, we will go further in depth with applying how residuals work in this study.
As a reminder residuals are the difference between the observed value (\(Y_i\)) and the predicted value (\(\hat{Y_i}\)).
In context of this study, the residual of a given point would be the
difference between the observed MAXTEMP and the predicted
MAXTEMP. This can be depicted as the following:
\[Residual = \text{Observed MAXTEMP - Predicted MAXTEMP}\] Below is the table of residuals for all 8 of the points used in this data set.
pander(janlm$residuals)
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
|---|---|---|---|---|---|---|---|
| -3.683 | -2.259 | 4.943 | 3.026 | -6.745 | 3.088 | 1.54 | 0.08834 |
| Residual Value | Meaning |
|---|---|
| Positive Residual(+) | the prediction MAXTEMP is lower than
the observed MAXTEMP (aka. an under prediction) |
| Negative Residual(-) | the prediction MAXTEMP is higher than
the observed MAXTEMP (aka. an over prediction) |
| Close to 0 | the prediction MAXTEMP is very close
to the observed MAXTEMP (aka. a good fit prediction) |
The graphic below shows us pink dots as a visualization of the residuals.
janweather$predicted_MAXTEMP <- predict(janlm)
janweather$residuals <- janweather$MAXTEMP - janweather$predicted_MAXTEMP
ggplot(janweather, aes(x = STARTMAXTEMP, y = MAXTEMP)) +
geom_point(
size = 2,
color = "pink"
) +
geom_smooth(method = "lm", formula = y ~ x, se = FALSE, color = "black") + # Mean line
# Add vertical lines representing residuals
geom_segment(aes(x = STARTMAXTEMP, xend = STARTMAXTEMP, y = predicted_MAXTEMP, yend = MAXTEMP),
color = "pink", linetype = "solid", size = 0.8) + # Residuals (error lines)
labs(title = "Residuals of Weather Prediction Analysis") +
theme_minimal()
These values are depicted below:
janweather$predicted_MAXTEMP <- predict(janlm)
janweather$residuals <- janweather$MAXTEMP - janweather$predicted_MAXTEMP
SSTO <- sum((janweather$MAXTEMP - mean(janweather$MAXTEMP))^2)
SSR <- sum((janweather$predicted_MAXTEMP - mean(janweather$MAXTEMP))^2)
SSE <- sum(janweather$residuals^2)
pander(cat("SSE:", round(SSE,2), "\n"))
SSE: 109.66
pander(cat("SSR:", round(SSR,2), "\n"))
SSR: 684.34
pander(cat("SSTO:", round(SSTO,2), "\n"))
SSTO: 794
Here is how these concepts apply:
| Concept | Meaning | Application |
|---|---|---|
| Sum of Squared Errors (SSE) | measures the unexplainable variation in the data | - how much of the variation in MAXTEMP is not explained
by the relationship with STARTMAXTEMP- We want our SSE to
be smaller than our SSTO as that indicates our model is
a good fit and the amount of unexplained variability we have, and with a
SSE of 109.66, this confirms that our model is a good fit and doesn’t
have a lot of unexplained variability |
| Sum of Squared Regression (SSR) | measures the explainable variation in the data | - how much of the variation in MAXTEMP is explained by
the relationship with STARTMAXTEMP- We want our SSR to be
big as that indicates our model is a good fit, and with
a SSR of 684.34 this confirms that our model does a good job at
explaining the variability of MAXTEMP and a good fit for
our data |
| Sum of Squared Total(SSTO) | measures the total variation in the data, combining the explained and unexplained parts | - total variability in MAXTEMP |
mean_MAXTEMP <- mean(janweather$MAXTEMP)
ggplot(janweather, aes(x = STARTMAXTEMP, y = MAXTEMP)) +
geom_point(
size = 2,
color = "grey"
) +
geom_smooth(method = "lm", formula= y~x, se = FALSE, color = "black") +
geom_segment(aes(x = STARTMAXTEMP, xend = STARTMAXTEMP, y = predicted_MAXTEMP, yend = mean_MAXTEMP),
color = "green", linetype = "dashed", size= .8) +
geom_hline(yintercept = mean_MAXTEMP, color = "grey", linetype = "solid", size = 0.8) +
labs(title = "SSR of Weather Prediction Analysis") +
theme_minimal()
ggplot(janweather, aes(x = STARTMAXTEMP, y = MAXTEMP)) +
geom_point(
size = 2,
color = "grey"
) +
geom_smooth(method = "lm", formula= y~x, se = FALSE, color = "black") +
geom_segment(aes(x = STARTMAXTEMP, xend = STARTMAXTEMP, y = MAXTEMP, yend = predicted_MAXTEMP),
color = "red", linetype = "dotted", size = 1) +
geom_hline(yintercept = mean_MAXTEMP, color = "grey", linetype = "solid", size = 0.8) +
labs(title = "SSE of weather Prection Analysis") +
theme_minimal()
ggplot(janweather, aes(x = STARTMAXTEMP, y = MAXTEMP)) +
geom_point(
size = 2,
color = "grey"
)+
geom_segment(aes(x = STARTMAXTEMP, xend = STARTMAXTEMP, y = MAXTEMP, yend = mean_MAXTEMP),
color = "blue", linetype = "dotted", size = 1) +
geom_hline(yintercept = mean_MAXTEMP, color = "blue", linetype = "dotted", size = 1) +
labs(title = "SSTO of Weather Prediction Analysis") +
theme_minimal()
ggplot(janweather, aes(x = STARTMAXTEMP, y = MAXTEMP)) +
geom_point(
size = 2,
color = "grey"
) +
geom_smooth(method = "lm", formula= y~x, se = FALSE, color = "lightblue") +
geom_segment(aes(x = STARTMAXTEMP + 0.1, xend = STARTMAXTEMP + 0.1, y = predicted_MAXTEMP, yend = mean_MAXTEMP),
color = "green", linetype = "dashed", size = .8) +
geom_segment(aes(x = STARTMAXTEMP + 0.2, xend = STARTMAXTEMP + 0.2, y = MAXTEMP, yend = predicted_MAXTEMP),
color = "red", linetype = "dotted", size = 1) +
geom_segment(aes(x = STARTMAXTEMP + 0.3, xend = STARTMAXTEMP + 0.3, y = MAXTEMP, yend = mean_MAXTEMP),
color = "blue", linetype = "dotted", size = 1) +
geom_hline(yintercept = mean_MAXTEMP, color = "blue", linetype = "dotted", size = 2) +
geom_hline(yintercept = mean_MAXTEMP, color = "grey", linetype = "solid", size = 0.8) +
labs(title = "SSR, SSE, and SSTO of Weather Prediction Analysis") +
theme_minimal()
In this study, R Squared explains how well our independent
variable, STARTMAXTEMP, explains/predicts the variability
in our dependent variable, MAXTEMP.
You can find our R Squared value by either computing in the equation below or by looking in our Simple Linear Regression Test under \(R^2\).
\[R^2 = \frac{SSR}{SSTO} = \frac{684.34}{794} = 0.8619 \]
janlm <- lm(MAXTEMP ~ STARTMAXTEMP, data=janweather)
summary(janlm)%>%
pander()
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 13.68 | 2.583 | 5.297 | 0.001835 |
| STARTMAXTEMP | 0.743 | 0.1214 | 6.119 | 0.0008698 |
| Observations | Residual Std. Error | \(R^2\) | Adjusted \(R^2\) |
|---|---|---|---|
| 8 | 4.275 | 0.8619 | 0.8389 |
With this value, we can interpret our 0.8619 \(R^2\) value with the following table:
| \(R^2\) Value | Interpretation |
|---|---|
| between 0 and 1 | Perfect fit, perfectly variablility in MAXTEMP using
STARTMAXTEMP |
| around 0 | Not a good fit, does not explain ANY variablility in
MAXTEMP and there is no relationship between the two
variables |
MAXTEMP that can be
explained with the STARTMAXTEMP
Below is a graph displaying red and blue boxes to depict how the SSE and the SSTO divide by eachother, to then subtract from 1 to achieve \(R^2\).
ggplot(janweather, aes(x = STARTMAXTEMP, y = MAXTEMP)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, color = "grey") +
geom_rect(aes(xmin=STARTMAXTEMP, xmax=STARTMAXTEMP+janlm$res, ymin=MAXTEMP, ymax=janlm$fit), color='red', alpha=0.1) +
geom_rect(aes(xmin = STARTMAXTEMP, xmax = STARTMAXTEMP+janlm$res, ymin=janweather$MAXTEMP, ymax = mean(janweather$MAXTEMP)), color = "blue", alpha=0.1) +
labs(title = "Visualizing R- Squared Calculation with SSE/SSTO - 1",
x = "Starting Max Temperature (F)", y = "Max Temperature (f)") +
theme_minimal()
Both the MSE and the “Residual Standard Error” help in assessing the
accuracy and reliability of our weather prediction model. - MSE giving
us the overall unitless measure of our prediction error - Lower MSE:
model is doing well in predicting the Y(MAXTEMP) from the
X(STARTTEMP) - Higher MSE: model is NOT doing well in
predicting the Y(MAXTEMP) from the
X(STARTTEMP), as the data does not fit well - “Residual
Standard Error” gives us a specific unit measurement of how much error
is present in our model’s predictions
predictions <- predict(janlm)
MSE <- mean((janweather$MAXTEMP - predictions)^2)
rse <- sqrt(MSE)
pander(cat("MSE:", round(MSE,2), "\n"))
MSE: 13.71
pander(cat("RSE:", round(rse,2), "°F"))
RSE: 3.7 °F
With these values we are able to deduce the following: - MSE: The
average of all the squared differences is 13.71 - RSE:
On average, the MAXTEMP from our study is about
3.7°F from the actual values
This can be visualized using the graph below, the length of one side of the purple boxes being the RSE and the dark green box in the left hand corner being the MSE.
ggplot(janweather, aes(x = STARTMAXTEMP, y = MAXTEMP)) +
geom_point(
size = 2,
color = "purple"
) +
geom_smooth(method = "lm", formula = y ~ x, se = FALSE, color = "black") + # Mean line
# Add vertical lines representing residuals
geom_segment(aes(x = STARTMAXTEMP, xend = STARTMAXTEMP, y = predicted_MAXTEMP, yend = MAXTEMP),
color = "purple", linetype = "solid", size = 0.8) +
geom_rect(aes(xmin=STARTMAXTEMP, xmax=STARTMAXTEMP+janlm$res, ymin=MAXTEMP , ymax=janlm$fit), alpha = 0.3, color="purple") +
geom_rect(aes(xmin=1, xmax=1+3.7, ymin=40, ymax=40+3.7), color='darkgreen', alpha=0.1) +
labs(title = "Residuals of Weather Prediction Analysis") +
theme_minimal()